DataStories
Abstract
This paper built two sentiment analysis systems based on LSTM with two kinds of attention mechanism on top of word embeddings pre-trained on a big collection of Twitter messages. They ranked 1st on the subtask A. They built their own text processing tools which are open for commuinity.1
github.com/cbaziotis/ekphrasis
Overview
Except the word tokenization and normalization. They also perfrom spell correction.
Comment
This paper uses LSTM to do the sentiment analysis. It is a mainstream method of the last year. I think it is also a suitable model for NLP tasks. Because it keep the order information of the sentences.
BB_twtr
This model uses 10 CNNs and 10 LSTMs to build an ensemble neural model. It achieves the same score with DataStories. But I think it still remains a lot of potential improvement.
LIA
The author also achieved a good result in SemEval2016. The system submitted in last year is also a ensemble model.
The Ensembel Model
The system is an ensemble of DNN: CNN and RNN-LSTM.
Same point
They also found that the weight initialization can lead to a high variation of performance. Therefore they also trained 20 models and selected the one that obtained the best result on the development dataset.
Sentiment Embedding (negative sampling)
The negative-sampling approach is an efficient way of computing softmax. Instead of selecting random words, as is usual for this technique, they chose to select words with opposite polarities.
Sentence Level Feature
- Lexicons
- Emoticons
- All-caps
- Elongated Units
- Punctuation
Mimic Model
They proposed a training approach consist of a teacher model (state of the art system) and a student model(mimic model). The mimic model is not trained on the original labels, but it is trained to learn the targets predicted by the teacher model.
In this paper, they say the student model can outperform the teacher model. They give three reasons:
- If some labels have errors, the teacher model may eliminate some of these errors thus making it learning easier for the student.
- Learning from the original, hard 0/1 labels can be more difficult than learning from a teacher’s conditional probabilities; but the mimic model sees non-zero targets for most outputs on most training cases, and the teacher can spread uncertainty over multiple outputs for difficult cases. The uncertainty from the teacher model is more informative to the student model than the original 0/1 labels.
- The mimic model can be seen as a form of regularization that helps prevent overfitting the model.
But I think the reason is not reasonable.
CNN vs RNN-LSTM
In this paper, the experiment shows the mimic CNN performs better than RNN-LSTM. But they do not test the mimic RNN-LSTM.